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Abstract 

In this paper, we consider a generalized longest common subsequence problem, the string-excluding 
constrained LCS problem. For the two input sequences X and Y of lengths n and m, and a constraint 
string P of length r, the problem is to find a common subsequence Z of X and Y excluding P as a 
substring and the length of Z is maximized. The problem and its solution were first proposed by Chen 
and ChaopQ, but we found that their algorithm can not solve the problem correctly. A new dynamic 
programming solution for the STR-EC-LCS problem is then presented in this paper. The correctness of 
the new algorithm is proved. The time complexity of the new algorithm is Oinmr). 

1 Introduction 

In this paper, we consider a generalized longest common subsequence problem. The longest common 
subsequence (LCS) problem is a well-known measurement for computing the similarity of two strings. 
It can be widely applied in diverse areas, such as file comparison, pattern matching and computational 

bioiogyHEllllSGg. 

A sequence is a string of characters over an alphabet A subsequence of a sequence X is obtained 
by deleting zero or more characters from X (not necessarily contiguous). A substring of a sequence A is a 
subsequence of successive characters within X . 

For a given sequence X = X1X2 ■ • ■ x n of length n, the ith character of X is denoted as Xi £ f° r an y 
i = 1, • • ■ ,n. A substring of X from position i to j can be denoted as X[i : j] — XiXi + \ ■ ■ ■ xj. A substring 
X[i : j] — XiXi+i ■ ■ ■ Xj is called a prefix or a suffix of X if i = 1 or j = n, respectively. 

Given two sequences X and Y, the longest common subsequence (LCS) problem is to find a subsequence 
of X and Y whose length is the longest among all common subsequences of the two given sequences. 

For some biological applications some constraints must be applied to the LCS problem. These kinds of 
variant of the LCS problem are called the constrained LCS (CLCS) problem|10j. Recently, Chen and Chao[T] 
proposed the more generalized forms of the CLCS problem, the generalized constrained longest common 
subsequence (GC-LCS) problem. For the two input sequences X and Y of lengths n and to, respectively, and 
a constraint string P of length r, the GC-LCS problem is a set of four problems which are to find the LCS of 
X and Y including/excluding P as a subsequence/substring, respectively. The four generalized constrained 
LCS can be summarized in Table 1. 



Table 1: The GC-LCS problems 



Problem Input Output 



SEQ-IC-LCS X,Y, and P The longest common subsequence of X and Y including P as a subsequence 

STR-IC-LCS X,Y, and P The longest common subsequence of X and Y including P as a substring 

SEQ-EC-LCS X,Y, and P The longest common subsequence of X and Y excluding P as a subsequence 

STR-EC-LCS X,Y, and P The longest common subsequence of X and Y excluding P as a substring 
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We will discuss the STR-EC-LCS problem in this paper. We have noticed that a previous proposed 
dynamic programming algorithm for the STR-EC-LCS problem[T] can not correctly solve the problem. A 
new dynamic solution for the STR-EC-LCS problem is then presented in this paper. The correctness of the 
new algorithm is proved. The time complexity of the new algorithm is 0(nmr). 

The organization of the paper is as follows. 

In the following 4 sections we describe our presented dynamic programming algorithm for the STR-EC- 
LCS problem. 

In section 2 we review the dynamic programming algorithm for the STR-EC-LCS problem proposed by 
Chen and Chao[T]. We point out that their algorithm will not work for a simple counterexample. In section 
3 we give a new dynamic solution for the STR-EC-LCS problem with time complexity O(nmr) in a different 
point of view. In section 4 we discuss the issues to implement the algorithm efficiently Some concluding 
remarks are in section 5. 

2 A Proposed Dynamic Programming Algorithm 

In this section, we will focus on the STR-EC-LCS problem and its solution proposed previously by Chen 
and ChaodJ. As noted in table 1, for the two input sequences X and Y of lengths n and to, and a constraint 
string P of length r, the STR-EC-LCS problem is to find an LCS Z of X and Y excluding P as a substring. 

Let L(i,j, fc) denote the length of an LCS of X[l : i] and Y[l : j] excluding P[l : fc] as a substring. Chen 
and Chao gave a recursive formula (1) for computing L(i,j, fc) as follows. 



The boundary conditions of this recursive formula are L(i, 0, fc) = L(0, j, fc) = for any < i < n, < 
j < to, and < fc < r. 

The correctness of the recursive formula (1) was based on Theorem 3 of their paper[Tj as follows. 

Theorem 1 (Chen and Chao 2011) Let Sij t k denote the set of all LCSs of X[l : i] and Y[l : j] excluding 
P[l : fc] as a substring. If Z[\ : I] G Si jfc, the following conditions hold: 

(1) If Xi = Uj = pk and k = 1, then Z\ ^ Xi and Z[l : I] G Si-ij—i^k- 

(2) If Xi — yj — pk and k > 2, then z\ = X4 = yj = pk and = pk-i implies Z[l : I — 1] € Sf_ l,^— 1. 

(3) If Xi — yj — pk and k>2, then z\ — X4 — yj — pk and ^ Pk—l implies Z[l : I — 1] € Si-i t j—i,k- 

(4) If Xi — yj — pk and k > 2, then z\ 5^ Xi implies Z[l : I] 6 $i— fc. 

(5) If If Xi = y and x t ^ p k , then z t = x l = yj and Z[l : I - I] G S^ij-i^. 

(6) If Xi ^ yj, then z\ ^ Xi implies Z[l : I] G Si-ij t k- 

(7) If Xi ^ yj, then z\ ^ yj implies Z[l : I] G Sij-i^k- 

Since a common subsequence of X[l : i] and Y[l : j] excluding P[l : fc — 1] as a substring is also a common 
subsequence of X[l : i] and Y[\ : j] excluding P[l : fc] as a substring, by the definition of L(i,j, fc), we know 
that L(i,j, fc) > L(i,j, fc — 1) is always true. Therefore, the recursive formula (1) can be further reduced to 
the recursive formula (2). 




l,fc)} if fc > 2 and Xi = yj = Pk, 

if Xi = yj and (fc = 0, or fc > and Xi 7^ pk), 
if Xi ^ yj. 




(1) 




max{L(i - 1, j,k),L(i,j - l,k)} if x t ^yj. 



if fc = 1 and x t = y 3 = p k , 
if fc > 2 and x l = y 3 = p k , 

if Xi = yj and (fc = 0, or fc > and Xi ^ pk), 



(2) 
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Furthermore, the most important thing is that the above Theorem was only stated but without a strict 
proof. Therefore, the correctness of the proposed algorithm can not be guaranteed. For example, if X = 
abbb, Y = aab and P — ab, the values of L(i, j, k), 1 < i < 4, 1 < j < 3, < k < 2 computed by recursive 
formula (1) and (2) are listed in Table 2. 



Table 2: L(i,j,k) computed by recursive formula (1) and (2) 





k = 


k = 1 


k = 2 


i = 1 
i = 2 
i = 3 

i = 4 


1 1 1 
1 1 2 
1 1 2 
1 1 2 



1 
1 
1 


1 1 1 
1 1 2 
1 1 2 
1 1 2 



From Table 2 we know that the final answer is L(4, 3, 2) = 2 which is computed by the formula that 
L(4, 3, 2) = 1 + i(3, 2, 2) since in this case k > 2 and 04 — b^ — p2 =' b' . But, this is a wrong answer, since 
the correct answer should be 1. 

We have tried to modify the recursive formula (1) or (2) to a correct one, but failed. 

In next section, we will investigate the problem in a different way and finally present a correct dynamic 
solution for the STR-EC-LCS problem. 



3 Our New Dynamic Programming Solution 

For the two input sequences X — x\xi ■ ■ ■ x n and Y = yiya ■ ■ ■ Vm of lengths n and m, respectively, and a 
constraint string P — P1P2 ■ ■ ■ p r of length r, we want to find an LCS of X and Y excluding P as a substring. 

In the description of our new algorithm, a function a will be mentioned frequently. For any string S 
and a fixed constraint string P, the length of the longest suffix of S that is also a prefix of P is denoted by 
function a(S). 

The symbol © is also used to denote the string concatenation. 

For example, if P = aaba and S = aabaaab, then substring aab is the longest suffix of S that is also a 
prefix of P, and therefore a(S) = 3. 

It is readily seen that S © P = aabaaabaaba. 

Let Z(i,j, k) denote the set of all LCSs of X[l : i] and Y[l : j] excluding P as a substring and a{z) = k 
for each z € Z(i,j, k). The length of of an LCS in Z(i,j, k) is denoted as f(i,j, k). 

If we can compute f(i,j, k) for any l<i<n,l<j<m, and < k < r efficiently, then the length of an 
LCS of X and Y excluding P as a substring must be max {/(n,TO,t)}. 

We can give a recursive formula for computing f(i,j, k) by following Theorem. 

Theorem 2 For the two input sequences X — x\Xi ■ • ■ x n and Y = y\y2 ■ ■ ■ y m of lengths n and m, respec- 
tively, and a constraint string P = p\Pi • "Pr °f length r, let Z(i,j, k) denote the set of all LCSs of X[l : i] 
and Y[l : j] excluding P as a substring and o~{z) = k for each z € Z(i,j, k). 
The length of of an LCS in Z(i,j, k) is denoted as f(i,j, k). 

For any 1 < i < n, 1 < j < m, and < k < r, f(i,j,k) can be computed by the following recursive 
formula {3Jj. 



max{/(i - l,j,k),f(i,j - l,k)} ifxi^y , 

(3) 



maxO(i-l,j-l,fc),l+ max {f(i- I, j-l,t)\a(P[l:t]®Xi) = k}} if x t = Vj 

0<t<r 
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The boundary conditions of this recursive formula are f(i,0,k) = f(0,j,k) — for any < i < n, < 
j < m, and < k < r. 

Proof. 

For any 1 < i < n, 1 < j < m, and < k < r, suppose f(i,j, k) = t and z = Z\, ■ ■ ■ , Zt € Z(i,j, k). 

First of all, we notice that for each pair {i' ,j'), 1 < i' < n, 1 < j' < m,such that i' < i and j' < j, we have 
f(i',j',k) < f(i,j,k), since a common subsequence z of X[l : i'] and Y[l : j'] excluding P as a substring 
and a(z) — k is also a common subsequence of X[l : i] and Y[l : j] excluding P as a substring and a(z) = k. 

(1) In the case of xi ^ yj, we have Xi ^ z t or y.j ^ z t . 

(1.1) If xi ^ z t , then z = zi, ■ ■ • , Zt is a common subsequence of X[l : i — 1] and Y[l : j] excluding P as 
a substring and o~(zi, ■ ■ ■ , z t ) = k, and so f(i — 1, j, /c) > t. On the other hand, f(i — k) < f(i,j, k) = t. 
Therefore, in this case we have f(i,j, k) = f(i — 1, j, k). 

(1.2) If yj =/= zt, then we can prove similarly that in this case, f(i,j, k) = f(i,j — 1, k). 

Combining the two subcases we conclude that in the case of Xi ^ yj , we have f(i,j, k) = max {/(« — 1, j, fc), f(i, j — 1, k)}. 

(2) In the case of Xi = yj, there are also two cases to be distinguished. 

(2. 1 )If Xi = yj =/= z t , then z = Z\, ■ ■ ■ , Zt is also a common subsequence of X[\ : i — 1] and Y[l : j — 1] 
excluding P as a substring and o~(z\, ■ ■ ■ , Zt) — k, and so f(i — l,j — l,k) > t. On the other hand, 
f(i — 1, j — 1, k) < f(i,j, k) = t. Therefore, in this case we have f(i,j, k) — f(i — 1, k). 

(2.2)If Xi — yj — Zt, then f(i,j, k) — t > and z = Z\, ■ ■ ■ , Zt is an LCS of X[l : i) and Y[l : j] excluding 
P as a substring and cr(z\, ■ ■ ■ , z t ) = k, and thus z\, • ■ ■ , z t -\ is a common subsequence of X[l : i — 1] and 
Y[l : j — 1] excluding P as a substring. 

Let a(zi, • • ■ , Zt_i) = <? and f(i — l,j — l,q) = s. Then zi, • • • , z 4 _i is a common subsequence of X[l : i — 1] 
and : j — 1] excluding P as a substring and er(,2i, • • • , z t ~i) = q. Therefore, we have 



/(i-l,j-l,g) = *>f-l. (4) 

Let v = vi, ■ ■ ■ , v s € Z(i — l,j — 1, q) is an LCS of X[l : i — 1] and Y[l : j — 1] excluding P as a substring 
and cr(«i, • • • ,w s ) = q. Then er((t>i, • • • ,v s ) © a^) = cr(-P[l : <?] © Xi) = k, and thus (vi, - ■ ■ ,v s ) © Xi is a 
common subsequence of X[l : i] and y[l : j] excluding P as a substring and cr((i>i, ■ • • , v s ) ® Xi) = k. 

Therefore, 

f(i,j,k) = t>s + l. (5) 

Combining ((4]) and ([5]) we have s = t — 1. Therefore, zi, • • • , z t _i is an LCS of X[l : i — 1] and Y [1 : j — 1] 
excluding F as a substring and c(zi, • • • , zt-i) = <?• 
In other words, 

f(i,j,k) < 1+ max {/(z-l ! i-l, g )|a(P[l:g]©^) = fe} (6) 

0<<j<r 

On the other hand, for any < q < r, if f(i — 1, j — 1, q) = s and er(-P[l : q] © x^) = k, then for any 
u = Vi, • ■ • ,v s £ Z(i — 1, j — 1, q), v ®Xi is a common subsequence of X[l : i] and Y[l : j] and © Xi) = k. 
Since v excludes P as a substring and a(v (& Xi) — k < r, u © Xi is a common subsequence of X[l : i] and 
Y[l : j] excluding P as a substring. Furthermore, v © Xi is a common subsequence of X[l : i] and Y[l : j] 
excluding P as a substring and <r(u © Xi) — k. Therefore, f(i, j, k)=t>l + s = l + f(i — 1, j — 1, q), and 
so we conclude that, 

f(i,j,k)>l + mux {f(i-l,j-l,q)\o-{P[l:q]®x i ) = k} (7) 

0<<j<r 

Combining |6|) and ([7|) we have, in this case, 

f(i,j,k) = l+ max {f(i-l,j-l,q)\a(P[l:q]®X i ) = k} (8) 

0<<j<r 

Combining the two subcases in the case of Xi — yj , we conclude that the recursive formula © is correct 
for the case Xi = yj. 

The proof is complete. ■ 
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4 The Implementation of the Algorithm 



According to Theorem[2] our new algorithm for computing /(?, j, k) is a standard 2-dimensional dynamic pro- 
gramming algorithm. By the recursive formula ([3]) , the new dynamic programming algorithm for computing 
f(i,j, k) can be implemented as the following Algorithm 1. 



Algorithm 1 STR-EC-LCS 

Input: Strings X = x\---x n , Y = y\---y m of lengths n and m, respectively, and a constraint string 
P = Pi • • • Pr of lengths r 

Output: The length of an LCS of X and Y excluding P as a substring 



1: for all i, j, k , < i < n, < j < m, and < k < r do 

2: f(i, 0, k) <- 0, / (0, j, k) <c- {boundary condition} 

3: end for 

4: for i = 1 to n do 

5: for j = 1 to to do 

6: for k = to r do 

7: if Xi 7^ then 

8: f(hj,k) max{/(i-l,i,A;),/(i,j - l,fc)} 

9: else 

10: u <- max {f(i - l,j - l 7 t)\a(P[l : t] © xA = k\ 

0<t<r 

li: f(i,j,k) <-max{/(i- l,j- l,k),l + u} 

12: end if 

13: end for 
14: end for 
15: end for 

16: return max {f(n, to, t)} 

0<t<r 



To implement our new algorithm efficiently, the most important thing is to compte c(P[l : k] © Xi) for 
each < k < r and xt 1 1 < i < n, in line 10 efficiently. 

It is obvious that <r(P[l : k] © Xi) = k + 1 for the case of Xi = Pk+i- It will be more complex to 
compute <t(P[1 ■ k] © Xi) for the case of Xi ^ Pk+i- In this case the length of matched prefix of P has to 
be shortened to the largest t < k such that pk-t+i • ■ 'Pk = Pi ■ ■ • Pt and Xi = pt+i- Therefore, in this case, 
er(P[l : k] ®Xi) =t + l. 

This computation is very similar to the computation of the prefix function in KMP algorithm for solving 
the string matching problem [3l [7]. 

For a given string S = s\ ■ ■ ■ s n , the prefix function kmp(i) denotes the length of the longest prefix 
of si ■ ■ ■ Si— i that matches a suffix of s\ ■ ■ ■ Sj. For example, if S = ababaa, then kmp{\) 1 • • • , kmp(6) = 
0,0,1,2,3,1. 

For the constraint string P = p\ ■ ■ ■ p r of lengths r, its prefix function kmp can be pre-computed in 0{r) 
time as follows. 

With this pre-computed prefix function kmp, the function c(P[l : k] © ch) for each character ch G 
and 1 < k < r can be described as follows. 

Then, we can compute an index t* such that 

f(i-l,j-l,t*) = max {f(i-l,j-l,t)\a(P[l:t]® Xi ) = k} 

0<t<r 

in line 10 of Algorithm 1 by the following Algorithm 4. 
Then the value of u in line 10 of Algorithm 1 must be 

u = f(i - l,j - l,t*) = f(i - 1, j - l,maxcr(j,j, k)). 
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Algorithm 2 Prefix Function 



Input: String P = p\ ■ ■ -p r 

Output: The prefix function kmp of P 

l: kmp(Q) < 1 

2: for i = 2 to r do 

3: fc <- 

4: while fe > and Pk+i ^ Pi do 
5: k <— kmp(k) 
6: end while 

7: fc <- fc + 1 

8: kmp(i) <— fc 
9: end for 



Algorithm 3 <r(k,ch) 

Input: String P = p\ ■ ■ -p r , integer k and character ch 
Output: <t(P[1 : k] © ch) 

l: while fc > and pfe+i ^ eft. do 
2: A: <— kmp(k) 
3: end while 
4: return fc + 1 



Algorithm 4 max cr(i, j, fc) 



Input: Integers fc 
Output: An index t* such that 



/(t - 1, j - l,t*) = max {f(i - l,j - l,t)\a(P[l : t]® Xi ) = k} 

0<t<r 



1: imp < 1, t* « 1 

2: for t = to r - 1 do 

3: if <j(t,Xi) = k and /(i — l,j — l,t)> trap then 

4: imp «- /(i - l,j - l,t),t* <- t 

5: end if 

6: end for 

7: return t* 
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We can improve the efficiency of above algorithms further in following two points. 

First, we can pre-compute a table A of the function <j(P[1 : k] ch) for each character cli £ ^ and 
1 < k < r to speed up the computation of maxcr(i, j, k). 



Algorithm 5 A(l :r,ch£ S) 

Input: String P = p\- ■ -p r , alphabet £ 

Output: A table A 

1: for all a e £ and a ^ p\ do 

2: A(0,a)^0 

3: end for 

4: A(0,pi)<-1 

5: for t = 1 to r — 1 do 
6: for all a € £ do 
7: if a = p t+ i then 
8: X(t,a)<-t+l 
9: else 

10: A(i, a) «— X(kmp(t), a) 

11: end if 
12: end for 
13: end for 



The time cost of above preprocessing algorithm is obviously 0(r|£|). By using this pre-computed table 
A, the value of function cr(P[l : k] © ch) for each character ch eJ2 an d I < k < r can be computed readily 
in O(l) time. 

Second, the computation of function max<r(i,j, k) is very time consuming and many repeated compu- 
tations are overlapped in the whole for loop of the Algorithm 1. We can amortized the computation of 
function maxcr(i, j, k) to each entry of f(i, j, k) in the for loop on variable k of the Algorithm 1 and finally 
reduce the time costs of the whole algorithm. The modified algorithm can be described as follows. 

Since X(k, Xi) can be computed in O(l) time for each Xi, 1 < i < n and any < k < r, the loop body of 
above algorithm requires only O(l) time. Therefore, our new algorithm for computing the length of an LCS 
of X and Y excluding P as a substring requires 0{nmr) time and 0(r|£|) preprocessing time. 

If we want to get the answer LCS of X and Y excluding P as a substring, but not just its length, we can 
also present a simple recursive back tracing algorithm for this purpose as the following Algorithm 7. 

In the end of our new algorithm, we will find an index t such that f(n,m,t) gives the length of an LCS 
of X and Y excluding P as a substring. Then, a function call back(n,m,t) will produce the answer LCS 
accordingly. 

Since the cost of the algorithm maxtr(i, j, k)) is O(r) in the worst case, the algorithm back(i, j, k) will 
cost 0(r max(n, m)). 

Finally we summarize our results in the following Theorem. 

Theorem 3 The Algorithm 6 solves STR-EC-LCS problem correctly in 0(nmr) time and 0(nmr) space, 
with preprocessing time 0[r |£|)- 

5 Concluding Remarks 

We have suggested a new dynamic programming solution for the STR-EC-LCS problem. The new algorithm 
corrects a previously presented dynamic programming algorithm with the same time and space complexities. 

The STR-IC-LCS problem is another interesting generalized constrained longest common subsequence 
(GC-LCS) which is very similar to the STR-EC-LCS problem. 

The STR-IC-LCS problem , introduced inpQ, is to find an LCS of two main sequences, in which a 
constraining sequence of length r must be included as its substring. In [I] an 0(nmr)-time algorithm was 
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Algorithm 6 STR-EC-LCS 

Input: Strings X — x\---x n , Y = yi ■ ■ ■ y m of lengths n and to, respectively, and a constraint string 
P = Pi ■ • ■ Pr of lengths r 

Output: The length of an LCS of X and Y excluding P as a substring 
l: for alli,j, k , < i < n, < j < m, and < k < r do 
2: f(i, 0, k) <— 0, /(0, j, k) {boundary condition} 
3: end for 
4: for i = 1 to n do 
5: for j = 1 to to do 
6: for fc = to r do 

7: f(i,j,k) <- max{/(i - l,j,k),f(i,j - l,fc)} 

8: end for 

9: if 

— J/j then 
10: for fc = to r do 

11: t*-\(k,Xi) 

12: /(*,J,t) max{/(i,i,i),l + f(i - l,j - l,fc)} 

13: end for 

14: end if 

15: end for 

16: end for 

17: return max {/(n, to, t)} 

0<*<r 



Algorithm 7 back(i, j, k) 

Comments: A recursive back tracing algorithm to construct the answer LCS 
1: if i = or j = then 
2: return 
3: end if 
4: if Xi = yj then 

5: if f(i,j,k) = f(i - l,j - l,fc) then 
6: back(i — l,j — l,k) 
7: else 

8: back(i — l,j — l,maxa(i,j, k)) 
9: print Xi 
10: end if 

11: else if f(i — l,j, k) > f(i,j — 1, k) then 
12: back(i — 1, j, k) 
13: else 

14: back(i,j — l,k) 
15: end if 
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given for it. Almost immediately the presented algorithm was improved to a quadratic-time algorithm and 
furthermore to many main input sequences [4]. 

It is not clear that whether the same improvement can be applied to our presented 0(nmr)-time algorithm 
for the STR-EC-LCS problem to achieve a quadratic-time algorithm. We will investigate the problem further. 
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